So far in this class, we’ve only done supervised learning. Meaning, we have a response variable and we observe its value for all or some of our observations.
Clustering is a type of unsupervised learning. Meaning, we want to sort our observations into clusters based on the predictors, but we don’t have a pre-conceived notion of what those clusters represent!
The general goal of clustering is to make clusters such that points within a cluster are closer to each other than to the points outside the cluster.
What is our definition of close?
How many clusters do we think exist?
What algorithm do we use to select the clusters?
Idea: Iteratively update the centers of the clusters until convergence.
There are many ways to choose initial centroids!
https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
The kmeans() function must be given a numeric matrix / dataframe and a number of clusters.
It gives back centroids, cluster assignments, and sum-of-squares.
a do is or this all down it
1 0.01133 0.0005557 0.006617 0.011255 0.004472 0.002808 0.00000000 0.01659
2 0.02264 0.0004926 0.011389 0.006299 0.007694 0.003692 0.00006095 0.01290
3 0.01991 0.0003927 0.011035 0.006990 0.007490 0.003988 0.00020684 0.01388
our to also even its shall up an
1 0.0044983 0.03462 0.0013692 0.0005221 0.002432 0.001175 0.00000000 0.002015
2 0.0020956 0.03882 0.0003545 0.0009570 0.003617 0.001507 0.00037792 0.005632
3 0.0007316 0.03991 0.0004887 0.0008273 0.003761 0.001594 0.00006386 0.005143
every may should upon and for. more so
1 0.0004988 0.003866 0.002959 0.0001198 0.04885 0.006796 0.005968 0.003247
2 0.0018063 0.004356 0.001963 0.0027342 0.02559 0.006711 0.002886 0.002195
3 0.0015005 0.004252 0.002164 0.0024354 0.02481 0.006119 0.002760 0.001935
was any from must some were are had
1 0.001699 0.002560 0.006205 0.001444 0.0015569 0.001991 0.005777 0.001105
2 0.001539 0.003022 0.005779 0.002555 0.0014136 0.001507 0.005327 0.001454
3 0.001450 0.002925 0.004930 0.001991 0.0009518 0.001211 0.004672 0.001243
my such what as has no than when
1 0.0003618 0.003601 0.001148 0.011498 0.001965 0.001162 0.004317 0.0016872
2 0.0003375 0.002146 0.001287 0.008645 0.003715 0.002593 0.003044 0.0010076
3 0.0001322 0.002073 0.001191 0.008948 0.002657 0.002629 0.002846 0.0008404
at have not that which be her now
1 0.002607 0.005895 0.007625 0.01751 0.006758 0.01880 0.0010409 0.0004457
2 0.003359 0.007324 0.006319 0.01493 0.011130 0.01995 0.0008013 0.0004449
3 0.002745 0.006013 0.006159 0.01470 0.010791 0.02168 0.0001255 0.0004326
the who been his of their will but
1 0.06104 0.003742 0.001825 0.0006076 0.04376 0.009810 0.008446 0.004781
2 0.08707 0.002502 0.004712 0.0021895 0.06232 0.005920 0.006149 0.003631
3 0.10641 0.001867 0.003792 0.0017050 0.06602 0.004578 0.006730 0.003198
if. on then with by in. one there
1 0.004054 0.005086 0.0005555 0.007253 0.009479 0.01921 0.005726 0.001024
2 0.003052 0.004274 0.0003348 0.005290 0.008641 0.02383 0.002618 0.002865
3 0.003313 0.004203 0.0004346 0.005784 0.007441 0.02323 0.002792 0.002621
would can into only things your
1 0.009061 0.002233 0.003068 0.003100 0.00008643 0.0004461
2 0.007405 0.002344 0.001717 0.001319 0.00024624 0.0002759
3 0.007842 0.002768 0.001345 0.001586 0.00018331 0.0000000
Most of the time, when you see k-means clustering “in the wild”, it will be as above.
However, we are working on a new package to do unsupervised learning in the same style as tidymodels!
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: k_means()
── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step
• step_rm()
── Model ───────────────────────────────────────────────────────────────────────
K-means clustering with 3 clusters of sizes 39, 5, 26
Cluster means:
a do is or this all down it
1 0.02264 0.0004926 0.011389 0.006299 0.007694 0.003692 0.00006095 0.01290
2 0.01133 0.0005557 0.006617 0.011255 0.004472 0.002808 0.00000000 0.01659
3 0.01991 0.0003927 0.011035 0.006990 0.007490 0.003988 0.00020684 0.01388
our to also even its shall up an
1 0.0020956 0.03882 0.0003545 0.0009570 0.003617 0.001507 0.00037792 0.005632
2 0.0044983 0.03462 0.0013692 0.0005221 0.002432 0.001175 0.00000000 0.002015
3 0.0007316 0.03991 0.0004887 0.0008273 0.003761 0.001594 0.00006386 0.005143
every may should upon and for. more so
1 0.0018063 0.004356 0.001963 0.0027342 0.02559 0.006711 0.002886 0.002195
2 0.0004988 0.003866 0.002959 0.0001198 0.04885 0.006796 0.005968 0.003247
3 0.0015005 0.004252 0.002164 0.0024354 0.02481 0.006119 0.002760 0.001935
was any from must some were are had
1 0.001539 0.003022 0.005779 0.002555 0.0014136 0.001507 0.005327 0.001454
2 0.001699 0.002560 0.006205 0.001444 0.0015569 0.001991 0.005777 0.001105
3 0.001450 0.002925 0.004930 0.001991 0.0009518 0.001211 0.004672 0.001243
my such what as has no than when
1 0.0003375 0.002146 0.001287 0.008645 0.003715 0.002593 0.003044 0.0010076
2 0.0003618 0.003601 0.001148 0.011498 0.001965 0.001162 0.004317 0.0016872
3 0.0001322 0.002073 0.001191 0.008948 0.002657 0.002629 0.002846 0.0008404
at have not that which be her now
1 0.003359 0.007324 0.006319 0.01493 0.011130 0.01995 0.0008013 0.0004449
2 0.002607 0.005895 0.007625 0.01751 0.006758 0.01880 0.0010409 0.0004457
3 0.002745 0.006013 0.006159 0.01470 0.010791 0.02168 0.0001255 0.0004326
the who been his of their will but
1 0.08707 0.002502 0.004712 0.0021895 0.06232 0.005920 0.006149 0.003631
2 0.06104 0.003742 0.001825 0.0006076 0.04376 0.009810 0.008446 0.004781
3 0.10641 0.001867 0.003792 0.0017050 0.06602 0.004578 0.006730 0.003198
if. on then with by in. one there
1 0.003052 0.004274 0.0003348 0.005290 0.008641 0.02383 0.002618 0.002865
2 0.004054 0.005086 0.0005555 0.007253 0.009479 0.01921 0.005726 0.001024
3 0.003313 0.004203 0.0004346 0.005784 0.007441 0.02323 0.002792 0.002621
would can into only things your
1 0.007405 0.002344 0.001717 0.001319 0.00024624 0.0002759
2 0.009061 0.002233 0.003068 0.003100 0.00008643 0.0004461
3 0.007842 0.002768 0.001345 0.001586 0.00018331 0.0000000
Clustering vector:
[1] 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 3 3 1 1 3 1 1 1 3 3 3 1 1 3 1 1 1 1 1 1 3 1 1
[39] 1 3 3 3 3 3 1 1 3 3 2 3 3 3 1 3 1 3 1 1 3 3 1 1 3 1 3 3 3 1 1 1
Within cluster sum of squares by cluster:
[1] 0.017125 0.002869 0.013263
(between_SS / total_SS = 35.0 %)
Available components:
...
and 2 more lines.
# A tibble: 3 × 71
.cluster a do is or this all down it our
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Cluster… 0.0226 4.93e-4 0.0114 0.00630 0.00769 0.00369 6.10e-5 0.0129 2.10e-3
2 Cluster… 0.0113 5.56e-4 0.00662 0.0113 0.00447 0.00281 0 0.0166 4.50e-3
3 Cluster… 0.0199 3.93e-4 0.0110 0.00699 0.00749 0.00399 2.07e-4 0.0139 7.32e-4
# ℹ 61 more variables: to <dbl>, also <dbl>, even <dbl>, its <dbl>,
# shall <dbl>, up <dbl>, an <dbl>, every <dbl>, may <dbl>, should <dbl>,
# upon <dbl>, and <dbl>, for. <dbl>, more <dbl>, so <dbl>, was <dbl>,
# any <dbl>, from <dbl>, must <dbl>, some <dbl>, were <dbl>, are <dbl>,
# had <dbl>, my <dbl>, such <dbl>, what <dbl>, as <dbl>, has <dbl>, no <dbl>,
# than <dbl>, when <dbl>, at <dbl>, have <dbl>, not <dbl>, that <dbl>,
# which <dbl>, be <dbl>, her <dbl>, now <dbl>, the <dbl>, who <dbl>, …
# A tibble: 70 × 1
.cluster
<fct>
1 Cluster_1
2 Cluster_2
3 Cluster_2
4 Cluster_2
5 Cluster_2
6 Cluster_1
7 Cluster_1
8 Cluster_1
9 Cluster_1
10 Cluster_1
# ℹ 60 more rows
clusters Author n
1 Cluster_1 AH 31
2 Cluster_1 JM 8
3 Cluster_2 JJ 5
4 Cluster_3 AH 20
5 Cluster_3 JM 6
Did we really need all 200 variables to find those clusters?
Did we maybe “muddy the waters” by weighing all variables equally?
It is very common to do a PCA reduction before running K-means!
Three Centroids
# A tibble: 5 × 3
clust auth n
<int> <chr> <int>
1 1 JJ 4
2 2 AH 43
3 3 AH 8
4 3 JJ 1
5 3 JM 14
Four Centroids
# A tibble: 7 × 3
clust auth n
<int> <chr> <int>
1 1 AH 3
2 1 JJ 1
3 1 JM 13
4 2 AH 22
5 3 AH 26
6 3 JM 1
7 4 JJ 4
clusters Author n
1 Cluster_1 AH 31
2 Cluster_1 JM 8
3 Cluster_2 JJ 5
4 Cluster_3 AH 20
5 Cluster_3 JM 6
Pros:
Simple algorithm, easy to understand
Plays nice with PCA
SUPER fast to compute
Cons:
Very sensitive to location of initial centroids
User has to pick the number of clusters
Now, refer back to your PCA analysis of the cannabis data (from Monday).
(also called agglomerative clustering)
Idea: Merge observations that are close together.
Merging Observations
The hclust() function needs to be given a distance matrix.
When you give hclust() a distance matrix…
To decide how to assign clusters, we can choose how many clusters (k) we want we can use cuttree() to cut the dendrogram into a specific number of clusters.
[1] 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1
[39] 1 3 3 3 3 3 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Or we can choose a height cutoff (h) for the dendrogram
[1] 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 3 3 3 1 3 1 1 1 3 3 3 1 1 3 1 1 1 1 1 1 4 1 1
[39] 3 4 4 4 4 4 1 1 1 1 2 3 3 3 1 3 3 1 1 1 3 3 1 3 1 1 3 1 3 1 3 3
Pros:
Fast computation for moderate sized data
Gives back full information in dendrogram form
Cons: